checkstate: make tests more robust #249

flotter · 2023-06-27T11:23:00Z

checkstate: make check runners exit cleanly

The checks manager uses contexts to cancel check runners. This happens
during checkstate.PlanChanged() when the previous checkers are all
terminated, and the new checker definition is applied.

However, context cancellation is asynchronous (as highlighted in the
context documentation), so the check runners will continue to complete
even after the context cancellations returned.

Due to this race condition, the runner can still launch a check even after
the context cancellation is complete.

This problem is reproduced here:
https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c

The result of this is easily seen the following unit test failure:

panic: internal error: reaper must be started

goroutine 75 [running]:
github.com/canonical/pebble/internals/reaper.StartCommand
/home/flotter/pebble/internals/reaper/reaper.go:164
github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check()
/home/flotter/pebble/internals/overlord/checkstate/checkers.go:170
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck()
/home/flotter/pebble/internals/overlord/checkstate/manager.go:199
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop()
/home/flotter/pebble/internals/overlord/checkstate/manager.go:182
created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
/home/flotter/pebble/internals/overlord/checkstate/manager.go:74

In this particular case, the problem is that test teardown stops the reaper
process which is responsible for reaping check commands, while the final
(edge case) run of the check is still completing.

Following the same logic, this could also cause check overlap between terminating
previous checks and starting new checks in checkstate.PlanChanged().

The following changes fix the problem:

Use a WaitGroup to represent the group of active checks in the check manager
Wait() on the group to complete before creating the next group of checks
when PlanChanged is called.
Each check runner will use a deferred exit to decrement the group counter.

checkstate: make tests more robust

The following changes are made:

Remove a check which depends on check output in TestTimeout as
there is no way to remove the race between the 'echo FOO'
completing before the timeout is triggered, without significant rework.

This failure can easily be reproduced by running the test under
heavy cpu load (i.e. stress --cpu 10 --io 4)

manager_test.go:140:
c.Assert(check.ErrorDetails, Equals, "FOO")
... obtained string = ""
... expected string = "FOO"
Make waitCheck() more conservative when waiting for a check
runner iteration to complete. The worse case timeout can
safely be much longer as this does not have a general impact
on test duration.

This failure can easily be reproduced by running the test under
heavy cpu load (i.e. stress --cpu 10 --io 4)

manager_test.go:134:
check := waitCheck(c, mgr, "chk1", func(check *CheckInfo) bool {
return check.Status != CheckStatusUp
})
manager_test.go:283:
c.Fatalf("timed out waiting for check %q", name)
... Error: timed out waiting for check "chk1"

The checks manager uses contexts to cancel check runners. This happens during checkstate.PlanChanged() when the previous checkers are all terminated, and the new checker definition is applied. However, context cancellation is asynchronous (as highlighted in the context documentation), so the check runners will continue to complete even after the context cancellations returned. Due to this race condition, the runner can still launch a check even after the context cancellation is complete. This problem is reproduced here: https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c The result of this is easily seen the following unit test failure: panic: internal error: reaper must be started goroutine 75 [running]: github.com/canonical/pebble/internals/reaper.StartCommand /home/flotter/pebble/internals/reaper/reaper.go:164 github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check() /home/flotter/pebble/internals/overlord/checkstate/checkers.go:170 github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck() /home/flotter/pebble/internals/overlord/checkstate/manager.go:199 github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop() /home/flotter/pebble/internals/overlord/checkstate/manager.go:182 created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged /home/flotter/pebble/internals/overlord/checkstate/manager.go:74 In this particular case, the problem is that test teardown stops the reaper process which is responsible for reaping check commands, while the final (edge case) run of the check is still completing. Following the same logic, this could also cause check overlap between terminating previous checks and starting new checks in checkstate.PlanChanged(). The following changes fix the problem: - Use a WaitGroup to represent the group of active checks in the check manager - Wait() on the group to complete before creating the next group of checks when PlanChanged is called. - Each check runner will use a deferred exit to decrement the group counter.

The following changes are made: - Remove a check which depends on check output in TestTimeout as there is no way to remove the race between the 'echo FOO' completing before the timeout is triggered. This failure can easily be reproduced by running the test under heavy cpu load (i.e. stress --cpu 10 --io 4) manager_test.go:140: c.Assert(check.ErrorDetails, Equals, "FOO") ... obtained string = "" ... expected string = "FOO" - Make waitCheck() more conservative when waiting for a check runner iteration to complete. The worse case timeout can safely be much longer as this does not have a general impact on test duration. This failure can easily be reproduced by running the test under heavy cpu load (i.e. stress --cpu 10 --io 4) manager_test.go:134: check := waitCheck(c, mgr, "chk1", func(check *CheckInfo) bool { return check.Status != CheckStatusUp }) manager_test.go:283: c.Fatalf("timed out waiting for check %q", name) ... Error: timed out waiting for check "chk1"

flotter · 2023-06-27T11:32:12Z

internals/overlord/checkstate/manager_test.go

@@ -45,12 +45,14 @@ func (s *ManagerSuite) SetUpSuite(c *C) {
 	setLoggerOnce.Do(func() {
 		logger.SetLogger(logger.New(os.Stderr, "[test] "))
 	})
+}


Moved reaper.Start() and reaper.Stop() to test level. The command cleanup should happen before the next test starts, and it also provides a more robust test environment for race conditions, as now each test must cleanly exit before the reaper is stopped.

Note that the reaper is bound to the package level test binary process ID, and therefore moving it to the test setup and teardown make it incompatible for parallel testing. However, parallel testing is strictly opt-in, so this is still a valid requirement for this particular package tests.

flotter · 2023-06-27T11:33:17Z

internals/overlord/checkstate/manager_test.go

@@ -161,17 +162,15 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) {
 		},
 	})

-	// Wait for command to start (output file grows in size)
-	prevSize := 0
+	// Wait for command to start (output file is not zero in size)


Previous logic did not make 100% sense, so I simplified it a bit.

c.MkDir() is test managed and removed after the test. The tempfile is therefore always zero to start with, and so we can simply check for a non-zero value. In the original code, the prevSize = len(b) appeared redundant as the previous check would break out before it could happen for any non-zero value.

benhoyt

Thank you! This is really good work and detailed analysis. Much appreciated!

In terms of review: approved, but with a couple of very small suggestions.

internals/overlord/checkstate/manager.go

internals/overlord/checkstate/manager_test.go

anpep

At last! Thanks for the great work

flotter · 2023-06-28T12:55:32Z

internals/overlord/checkstate/manager_test.go

@@ -137,7 +139,6 @@ func (s *ManagerSuite) TestTimeout(c *C) {
 	c.Assert(check.Failures, Equals, 1)
 	c.Assert(check.Threshold, Equals, 1)
 	c.Assert(check.LastError, Equals, "exec check timed out")
-	c.Assert(check.ErrorDetails, Equals, "FOO")


As explained in the commit message, this check is too racey. There is no way there is any guarantee this would be done before the timeout cuts off the command execution. Given that this test focuses on the timeout mechanism, and not command output logging, I feel it is justified to remove this to improve test robustness.

flotter · 2023-06-28T12:56:19Z

internals/overlord/checkstate/manager_test.go

@@ -185,7 +184,6 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) {
 	stopChecks(c, mgr)

 	// Ensure command was terminated (output file didn't grow in size)
-	time.Sleep(50 * time.Millisecond)


stopChecks() calls PlanChanged(), which terminates the previous checks synchronously now, so this is not needed.

benhoyt · 2023-06-28T23:57:53Z

Small stability improvements, no API changes -- going to merge this without further review. We can always tweak in subsequent PRs if needed.

barrettj12 · 2023-07-05T00:32:51Z

@flotter I'm seeing a similar intermittent failure in the servstate tests - see #224. Maybe a similar fix could help here too?

flotter added 2 commits June 27, 2023 13:19

flotter requested review from benhoyt, anpep, barrettj12 and niemeyer June 27, 2023 11:25

flotter commented Jun 27, 2023

View reviewed changes

flotter mentioned this pull request Jun 27, 2023

Test race conditions (including cases only seen under heavy cpu load) #181

Closed

benhoyt approved these changes Jun 28, 2023

View reviewed changes

anpep approved these changes Jun 28, 2023

View reviewed changes

flotter removed the request for review from barrettj12 June 28, 2023 12:40

flotter added the High Priority Look at me first label Jun 28, 2023

flotter mentioned this pull request Jun 28, 2023

daemon: FakeCommand usage requires reaper.Start() #247

Merged

Code review changes canonical#1

c853715

flotter force-pushed the checkstate-race branch from 99d10fb to c853715 Compare June 28, 2023 12:42

flotter commented Jun 28, 2023

View reviewed changes

benhoyt approved these changes Jun 28, 2023

View reviewed changes

benhoyt merged commit 6892f25 into canonical:master Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkstate: make tests more robust #249

checkstate: make tests more robust #249

flotter commented Jun 27, 2023 •

edited

Loading

flotter Jun 27, 2023

flotter Jun 27, 2023 •

edited

Loading

benhoyt left a comment

anpep left a comment

flotter Jun 28, 2023

flotter Jun 28, 2023

benhoyt commented Jun 28, 2023

barrettj12 commented Jul 5, 2023

checkstate: make tests more robust #249

checkstate: make tests more robust #249

Conversation

flotter commented Jun 27, 2023 • edited Loading

flotter Jun 27, 2023

Choose a reason for hiding this comment

flotter Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

benhoyt left a comment

Choose a reason for hiding this comment

anpep left a comment

Choose a reason for hiding this comment

flotter Jun 28, 2023

Choose a reason for hiding this comment

flotter Jun 28, 2023

Choose a reason for hiding this comment

benhoyt commented Jun 28, 2023

barrettj12 commented Jul 5, 2023

flotter commented Jun 27, 2023 •

edited

Loading

flotter Jun 27, 2023 •

edited

Loading